Actor-independent action search using spatiotemporal vocabulary with appearance hashing

نویسندگان

  • Rongrong Ji
  • Hongxun Yao
  • Xiaoshuai Sun
چکیده

Human actions in movies and sitcoms usually capture semantic cues for story understanding, which offer a novel search pattern beyond the traditional video search scenario. However, there are great challenges to achieve action-level video search, such as global motions, concurrent actions, and actor appearance variances. In this paper, we introduce a generalized action retrieval framework, which achieves fully unsupervised, robust, and actor-independent action search in large-scale database. First, an Attention Shift model is presented to extract human-focused foreground actions from videos containing global motions or concurrent actions. Subsequently, a spatiotemporal vocabulary is built based on 3D-SIFT features extracted from these human-focused action regions. These 3D-SIFT features offer robustness against rotations and viewpoints. And the spatiotemporal vocabulary guarantees our search efficiency, which is achieved by inverted indexing structure with approximate nearest-neighbor search. In the online ranking, we employ dynamic time warping distance to handle the action duration variances, as well as partial action matching. Finally, an appearance hashing strategy is presented to address the performance degeneration caused by divergent actor appearances. For experimental validation, we have deployed actor-independent action retrieval framework in 3-season ‘‘Friends’’ sitcoms (over 30h). In this database, we have reported the best performance ðMAP@140:53Þ with comparisons to alternative and state-of-the-art approaches. Crown Copyright & 2010 Published by Elsevier Ltd. All rights reserved.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Learning Vocabulary-Based Hashing with AdaBoost

Approximate near neighbor search plays a critical role in various kinds of multimedia applications. The vocabulary-based hashing scheme uses vocabularies, i.e. selected sets of feature points, to define a hash function family. The function family can be employed to build an approximate near neighbor search index. The critical problem in vocabulary-based hashing is the criteria of choosing vocab...

متن کامل

Action recognition with appearance-motion features and fast search trees

In this paper we propose an approach for action recognition based on a vocabulary of local motion-appearance features and fast approximate search in a large number of trees. Large numbers of features with associated motion vectors are extracted from video data and are represented by many trees. Multiple interest point detectors are used to provide features for every frame. The motion vectors fo...

متن کامل

A Practical Dramaturgy for Actors through Theatrical Production Procedures

Dramaturgy, as a creative and critical act dependent on the theoretical and practical knowledge of theater, consists of two parts with Greek roots: Drama (action) and Ourgia (work and operation). Ourgia literally means to practice, act, and in other words process a raw material. Eugenio Barba divides Dramaturgy in to three parts: Actor Dramaturgy, Director Dramaturgy and Audience Dramaturgy. Th...

متن کامل

Clustering is Efficient for Approximate Maximum Inner Product Search

Efficient Maximum Inner Product Search (MIPS) is an important task that has a wide applicability in recommendation systems and classification with a large number of classes. Solutions based on locality-sensitive hashing (LSH) as well as tree-based solutions have been investigated in the recent literature, to perform approximate MIPS in sublinear time. In this paper, we compare these to another ...

متن کامل

Guess Where? Actor-Supervision for Spatiotemporal Action Localization

This paper addresses the problem of spatiotemporal localization of actions in videos. Compared to leading approaches, which all learn to localize based on carefully annotated boxes on training video frames, we adhere to a weakly-supervised solution that only requires a video class label. We introduce an actor-supervised architecture that exploits the inherent compositionality of actions in term...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Pattern Recognition

دوره 44  شماره 

صفحات  -

تاریخ انتشار 2011